{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 使用朴素贝叶斯算法对文档进行分类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "朴素贝叶斯分类最适合的场景就是文本分类、情感分析和垃圾邮件识别。其中情感分析和垃圾邮件识别都是通过文本来进行判断的。我们可以看出来，这三个场景本质上都是文本分类，这也是朴素贝叶斯最擅长的地方。所以朴素贝叶斯也是常常用于自然语言处理NPL的工具。\n",
    "\n",
    "在Python的sklearn中就集成了朴素贝叶斯分类算法，他能帮助我们快速的上手和使用这个工具。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sklearn 机器学习包"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "sklearn的全称叫做scikit-learn，它给我们提供了三种朴素贝叶斯分类算法，分别是高斯朴素贝叶斯（GaussianNB），多项目式朴素贝叶斯（MultinomialNB）和伯努利朴素贝叶斯（BernoulliNB）。\n",
    "\n",
    "这三种算法适合应用在不同的场景，应该根据特征变量的不同选择不同的算法：\n",
    "\n",
    "**高斯朴素贝叶斯：** 特征变量是连续的变量，符合高斯分布，比如说人的身高，物体的长度。\n",
    "\n",
    "**多项式朴素贝叶斯：** 特征变量是离散型的变量，符合多项式分布，在文档分类中特征变量体现在一个单词出现的次数，或者是单词的TF-IDF值等。\n",
    "\n",
    "**伯努利朴素贝叶斯：** 特征变量是布尔变量，符合0/1分布，在文档分类中特征是单词是否出现。\n",
    "\n",
    "伯努利朴素贝叶斯是以文件为粒度的，如果该单词在某文件中出现了即为1，否者为0；而多项式朴素贝叶斯是以单词为粒度，会计算在某个文件中的具体次数。而高斯朴素贝叶斯适合处理特征变量是连续变量，且符合正态分布（高斯分布）的情况。比如说身高、体重这些自然界的现象就比较适合用高斯贝叶斯来处理。而文本分类是使用多项式朴素贝叶斯或者伯努利朴素贝叶斯。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 什么是TF-IDF值呢？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如何理解“词的TF-IDF值”这个概念？\n",
    "\n",
    "TF-IDF是一个统计的方法，用来评估某个词语对于一个文件集或者文档库中的其中一份文件的重要程度。\n",
    "\n",
    "TF-IDF实际上是两个词组Term Frequency和Inverse Document Frequency的总称，两者缩写为TF和IDF，分别代表了词频和逆向文档频率。\n",
    "\n",
    "**词频TF**计算了一个单词在文档中的出现的次数，它认为一个单词的重要性和它在文档中出现的次数成正比。\n",
    "\n",
    "**逆向文档频率IDF：**是指一个单词在文档中的区分度。它认为一个单词出现在的文档数越少，就越能通过这个单词把该文档和其他文档区分开。IDF越大就代表该单词的区分度越大。\n",
    "\n",
    "所以**TF-IDF实际上是词频TF和逆向文档频率IDF的乘积。**这样我们倾向于找到TF和IDF取值都很高的单词作为区分，即这个单词在一个文档中出现的次数多，同时又和少出现在其他文档中。这样的单词适合用于分类。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TF-IDF 如何计算"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先我们看下词频TF和逆向文档概率IDF的公式\n",
    "$$ 词频TF =  \\frac{单词出现的次数}{该文档的总单词数}$$\n",
    "$$ 逆向文档频率IDF = log\\frac{文档总数}{该单词出现的文档数+1}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为什么IDF的分母中，单词出现的文档数加1？因为有些单词可能不会存在文档中，为了避免分母为零，统一给单词出现的文档数都加1\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TF-IDF = TF*IDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以看到，TF—IDF的值就是TF与IDF的成绩，这样可以更加准确的对文档进行分类。比如“我”这样的高频单词，虽然TF词频高，但是IDF值很低，整体的TF-IDF也不高。\n",
    "\n",
    "例如：假设一个文件夹里面一共有10篇文档，其中一篇文档有1000个单词，“this”这个单词出现20次，“bayes”出现5次。“this”在所有的文档中均出现过，而“bayes”只在两篇文档中出现。我们可以计算一下这两个词语的TF-IDF的值。\n",
    "\n",
    "针对“this”，计算TF-IDF值：\n",
    "$$词频TF=\\frac{20}{1000} = 0.02$$\n",
    "$$逆向文档的频率IDF=log\\frac{10}{10+1} = -0.0414$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "所以TF-IDF = 0.02*(-0.0414) = -8.28e-4\n",
    "\n",
    "针对“bayes”，计算TF-IDF值："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$词频TF=\\frac{5}{1000} = 0.005$$\n",
    "$$逆向文档的频率IDF=log\\frac{10}{2+1} = 0.5229$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TF-IDF=0.005*0.5229=2.61e-3\n",
    "\n",
    "很明显“bayes”的TF-IDF的值要大于“this”的TF-IDF值。这就说明用“bayes”这个单词做区分比单词“this”要好。\n",
    "\n",
    "**如何求TF-IDF**\n",
    "\n",
    "在sklearn中我们直接使用TfidfVectorizer类，它可以帮助我们计算单词TF-IDF向量的值。在这个类中，取到sklearn计算的对数log时，底数是e，不是10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TfidfVectorizer 类的创建："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "创建TfidfVectorizer类的方法是："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "TfidfVectorizer(stop_words=stop_words, token_pattern=token_pattern)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在创建的时候，有两个构造参数，可以自定义停用词stop_words和规律规则token_pattern。需要注意的是传递的数据结构，停用词stop_words是一个列表List类型，而过滤规则token_pattern是正则表达式。\n",
    "\n",
    "什么是停用词？停用词就是在分类中没有用的词，这些词一般词频TF高，但是IDF很低，起不到分类的作用。为了节省空间和计算时间，我们把这些词作为停用词stop word，告诉机器这些词不需要帮我计算。\n",
    "\n",
    "|参数表|作用|\n",
    "|--|--|\n",
    "|stop_words|自定义停用词表，为列表list类型|\n",
    "|token_pattern|过滤规则，正则表达式，如r\"(?u)\\b\\w+\\b\"|\n",
    "\n",
    "当我们创建好TF-IDF向量类型时候，可以使用fit_transform帮助我们计算，返回给我们文本矩阵，该矩阵表示了每一个单词在每个文档中的TF-IDF值。\n",
    "\n",
    "|方法表|作用|\n",
    "|--|--|\n",
    "|fit_transform(X)|拟合模型，并返回文本矩阵|\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在我们进行fit_transform()拟合模型之后，我们可以得到更多的TF-IDF向量属性，比如，我们可以得到词汇的对应关系（字典类型）和向量类型的IDF，当然也可以获取设置的停用词stop_words.\n",
    "\n",
    "|属性表|作用|\n",
    "|--|--|\n",
    "|vocabulary_|词汇表；字典型|\n",
    "|idf_|返回idf值|\n",
    "|stop_words_|返回停用词表|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "举个例子，假设我们有4个文档：\n",
    "\n",
    "文档1：this is the bayes document；\n",
    "\n",
    "文档2：this is the second second document；\n",
    "\n",
    "文档3:and the third one;\n",
    "\n",
    "文档4：is this the document。\n",
    "\n",
    "现在想要计算文档里都有哪些单词，这些单词在不同的文档中的TF-IDF值是多少呢？\n",
    "\n",
    "首先我们创建TfidfVectorizer类："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec = TfidfVectorizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们创建4个文档的列表documents，并让创建好的tfidf_vec 对documents进行拟合，得到TF-IDF矩阵："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents=[\n",
    "    'this is the bayes document',\n",
    "\n",
    "    'this is the second second document',\n",
    "\n",
    "    'and the third one',\n",
    "\n",
    "    'is this the document'\n",
    "]\n",
    "tfidf_matrix = tfidf_vec.fit_transform(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出文档中所有不重复的词(出现了，打印出来，不能重复打印)："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('不重复的词：'，tfidf_vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "整合上面的代码如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "不重复的词： ['and', 'bayes', 'document', 'is', 'one', 'second', 'the', 'third', 'this']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "documents=[\n",
    "    'this is the bayes document',\n",
    "\n",
    "    'this is the second second document',\n",
    "\n",
    "    'and the third one',\n",
    "\n",
    "    'is this the document'\n",
    "]\n",
    "tfidf_matrix = tfidf_vec.fit_transform(documents)\n",
    "print('不重复的词：',tfidf_vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出每个单词对应的id值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "每个单词的ID： {'this': 8, 'is': 3, 'the': 6, 'bayes': 1, 'document': 2, 'second': 5, 'and': 0, 'third': 7, 'one': 4}\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "documents=[\n",
    "    'this is the bayes document',\n",
    "\n",
    "    'this is the second second document',\n",
    "\n",
    "    'and the third one',\n",
    "\n",
    "    'is this the document'\n",
    "]\n",
    "tfidf_matrix = tfidf_vec.fit_transform(documents)\n",
    "print('每个单词的ID：',tfidf_vec.vocabulary_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出每个单词在每个文档中的TF-IDF值，向量里的顺序是按照词语的id顺序来的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "每个单词的tfidf值： [[0.         0.63314609 0.40412895 0.40412895 0.         0.\n",
      "  0.33040189 0.         0.40412895]\n",
      " [0.         0.         0.27230147 0.27230147 0.         0.85322574\n",
      "  0.22262429 0.         0.27230147]\n",
      " [0.55280532 0.         0.         0.         0.55280532 0.\n",
      "  0.28847675 0.55280532 0.        ]\n",
      " [0.         0.         0.52210862 0.52210862 0.         0.\n",
      "  0.42685801 0.         0.52210862]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "documents=[\n",
    "    'this is the bayes document',\n",
    "\n",
    "    'this is the second second document',\n",
    "\n",
    "    'and the third one',\n",
    "\n",
    "    'is this the document'\n",
    "]\n",
    "tfidf_matrix = tfidf_vec.fit_transform(documents)\n",
    "print('每个单词的tfidf值：',tfidf_matrix.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 如何对文档进行分类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果要对文档进行分类，有两个阶段：\n",
    "\n",
    "![](文档分类.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1、**基于分词的数据准备**，包括分词、单词的权重计算、去掉停用词。\n",
    "\n",
    "2、**应用朴素贝叶斯分类进行分类**，首先通过训练集得到贝叶斯分类器，然后将分类器应用于测试集，并与实际结果做对比，最终得到测试集的分类准确率。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块1：对文档进行分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在准备阶段里，最重要的就是分词。如何分词？英文文档和中文文档所使用的分词工具不同。\n",
    "\n",
    "在英文文档里面，最常用的是NTLK包。NTLK包中包含了英文的停用词stop words、分词和标注方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "word_list = nltk.word_tokenize(text) # 分词\n",
    "nltk.pos_tag(word_list) # 标注单词的词性"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在中文文档中，最常用的就是jieba包。jieba包中包含了中文的停用词stop words和分词方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "word_list = jieba.cut(text)  # 中文分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块2：加载停用词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们需要自己读取停用词表文件，从网上可以找到中文常用的停用词保存在stop_words.txt,然后利用Python的文件读取函数读取文件，保存在stop_words数组中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words = [line.strip().decode('utf-8') for line in io.open('stop_words.txt').readlines()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块3：计算单词的权重\n",
    "\n",
    "这里我们使用到sklearn里面的TfidfVectorizer类，例如上面所介绍的使用方法。\n",
    "\n",
    "直接创建TfidfVectorizer类，然后使用fit_transform方法进行拟合，得到TF-IDF特征空间features，你可以理解为选出来的分词就是特征。我们计算这些特征在文档上面的特征向量，得到特征空间features。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf = TfidfVectorizer(stop_words=stop_words,max_df=0.5)\n",
    "features = tf.fit_transform(train_contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里max_df参数用来描述单词在文档中的最高频率。假设max_df=0.5,代表一个单词在50%的文档中都出现过，那么它就只能携带非常少的信息，因此就不做分词统计。\n",
    "\n",
    "一般很少设置min_df,因为min_df通常都会很小。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块4：生成朴素贝叶斯分类器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们将特征训练集中的特征空间train_features,以及训练集对应的分类train_labels传递给贝叶斯分类器clf，它会自动生成一个符合特征空间和对应分类的分类器。\n",
    "\n",
    "这里我们采用的是多项式贝叶斯分类器，其中alpha为平滑参数。为什么需要平滑呢？因为如果一个单词在训练样本中没有出现，这个单词的概率就会被计算为0。但是训练集样本只是整体的抽样情况，我们不能因为一个事件没有被观察到，就认为整个事件的概率为0.为了解决整个问题，我们需要做平滑处理。\n",
    "\n",
    "当alpha=1时候，使用的是Laplace平滑。Laplace平滑就是采用加1的方式，来统计没有出现过的单词的概率。这样当训练的样本很大的时候，加1得到的概率变化可以忽略不计，也同时避免了0概率的问题。\n",
    "\n",
    "当0<alpha<1时候，使用的是Lidstone平滑。对于Lidstone平滑来说，alpha越小，迭代的次数越多，精度就越高。我们可以设置alpha为0.001."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 多项式贝叶斯分类器\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块5：使用生成的分类器做预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先我们需要得到测试集的特征矩阵test_features.\n",
    "\n",
    "方法是用训练集的分词创建一个TfidfVectorizer类，使用同样的stop_words和max_df,然后用这个TfidfVectorizer类对测试集的内容进行fit_transform拟合，得到测试集的特征矩阵test_features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=train_vocabulary)\n",
    "test_features = test_tf.fit_transform(test_contents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后我们使用训练好的分类器对新数据做预测。\n",
    "\n",
    "方法是使用predict函数，传入测试集的特征矩阵test_features,得到分类结果predicted_labels。predict函数做的工作就是求解所有的后验概率并找出最大的那个。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted_labels = clf.predict(test_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 模块6：计算准确率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算准确率实际上就是对分类模型进行评估。我们可以调用sklearn中的metrics包，在metrics中提供了accuracy_score函数，方便我们对实际结果和预测结果做对比，给出模型的准确率。\n",
    "\n",
    "方法如下：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics\n",
    "print metrics.accuracy_score(test_labels,predicted_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 代码整合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "tfidf_vec = TfidfVectorizer()\n",
    "import nltk\n",
    "from sklearn.naive_bayes import MultinomialNB \n",
    "from sklearn import metrics\n",
    "\n",
    "documents = [\n",
    "    'this is the bayes document',\n",
    "    'this is the second second document',\n",
    "    'and the third one',\n",
    "    'is this the document'\n",
    "]\n",
    "tfidf_matrix = tfidf_vec.fit_transform(documents)\n",
    "# 1、对文档进行分词\n",
    "word_list = nltk.word_tokenize(documents) # 分词\n",
    "nltk.pos_tag(word_list) # 标注单词的词性\n",
    "# 2、加载停用词表\n",
    "stop_words = [line.strip().decode('utf-8') for line in io.open('stop_words.txt').readlines()]\n",
    "# 3、计算单词的权重\n",
    "tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5)\n",
    "features = tf.fit_transform(train_contents)\n",
    "# 4、生成朴素贝叶斯分类器\n",
    "# 多项式贝叶斯分类器\n",
    "clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)\n",
    "# 5、使用生成的分类器做预测\n",
    "test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=train_vocabulary)\n",
    "test_features=test_tf.fit_transform(test_contents)\n",
    "predicted_labels=clf.predict(test_features)\n",
    "# 6、计算准确率\n",
    "print metrics.accuracy_score(test_labels, predicted_labels)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据挖掘神器sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从数据挖掘的流程来看，一般包括了获取数据、清洗数据、模型训练、模型评估和模型部署这几个过程。\n",
    "\n",
    "sklearn中包含了大量的数据挖掘算法，比如三种朴素贝叶斯算法，我们只需要了解不同算法的适用条件，以及创建时所需要的参数，就可以用模型帮我们进行训练。在模型评估中，sklearn提供了metrics包，可以帮助我们对预测结果与实际结果进行估计。\n",
    "\n",
    "在文档分类的项目中，我们针对文档的特点，给出了基于分词的准备流程。一般来说NTLK包括用于英文文档，而jieba适用于中文文档。我们可以根据不同的包，对文档提取分词。这些分词就是贝叶斯分类中最重要的特征属性。基于这些分词，我们得到分词的权重，即特征矩阵。\n",
    "\n",
    "通过特征矩阵与分类的结果，我们就可以创建出朴素贝叶斯分类器，然后用分类器进行预测，最后预测结果与实际结果做对比即可以得到分类器在测试集上的准确率。\n",
    "\n",
    "\n",
    "### 总结\n",
    "**朴素贝叶斯分类器：**\n",
    "- sklearn工具：\n",
    "    - 三种朴素贝叶斯分类算法\n",
    "        - 高斯朴素贝叶斯：GaussianNB\n",
    "        - 多项式朴素贝叶斯：MultinomialNB\n",
    "        - 贝努力朴素贝叶斯：BernoulliNB\n",
    "    - TF-IDF\n",
    "        - 概念：词频TF，逆向文档频率IDF\n",
    "        - 使用sklearn求TF-IDF：TfidfVectorizer类\n",
    "- 如何对文档进行分类\n",
    "    - 准备阶段：对文档分词、加载停用词、计算单词的权重\n",
    "    - 分类阶段：生成分类器，分类器做预测、计算准确率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **中文文档分类**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I)数据说明：（text_classification）\n",
    "- 1、文档总共有4个种类：女性、体育、文学、校园：\n",
    "- 2、训练集放到train文件夹中，测试集放到test文件夹中，停用词放到stop文件夹中\n",
    "\n",
    "II)目标：使用朴素贝叶斯分类对训练集进行训练并对测试集进行验证，并给出测试集的准确率。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "准确率为： 0.92\n"
     ]
    }
   ],
   "source": [
    "# 中文文本分类\n",
    "# coding: utf-8\n",
    "import os\n",
    "import jieba\n",
    "import warnings\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import metrics\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "def cut_words(file_path):\n",
    "    \"\"\"\n",
    "    对文本进行切词\n",
    "    :param file_path:txt 文本路径\n",
    "    :return:使用空格分词的字符串\n",
    "    \"\"\"\n",
    "    text_with_spaces = ''\n",
    "    # 这里会遇到字符集编码问题，所以直接采用二进制方式读取\n",
    "    text = open(file_path, 'rb').read()\n",
    "    textcut = jieba.cut(text)\n",
    "    for word in textcut:\n",
    "        text_with_spaces += word + ' '\n",
    "    return text_with_spaces\n",
    "\n",
    "def loadfile(file_dir, label):\n",
    "    \"\"\"\n",
    "    将路径下的所有文件加载\n",
    "    :param file_dir: 保存txt文件目录\n",
    "    :param label: 文档标签\n",
    "    :return: 分词后的文档列表和标签\n",
    "    \"\"\"\n",
    "    file_list = os.listdir(file_dir)\n",
    "    words_list = []\n",
    "    labels_list = []\n",
    "    for file in file_list:\n",
    "        file_path = file_dir + '/' + file\n",
    "        words_list.append(cut_words(file_path))\n",
    "        labels_list.append(label)\n",
    "    return words_list, labels_list\n",
    "\n",
    "# 训练数据\n",
    "train_words_list1, train_labels1 = loadfile('./text_classification/train/女性','女性')\n",
    "train_words_list2, train_labels2 = loadfile('./text_classification/train/体育','体育')\n",
    "train_words_list3, train_labels3 = loadfile('./text_classification/train/文学','文学')\n",
    "train_words_list4, train_labels4 = loadfile('./text_classification/train/校园','校园')\n",
    "\n",
    "train_words_list = train_words_list1 + train_words_list2 + train_words_list3 + train_words_list4\n",
    "train_labels = train_labels1 + train_labels2 + train_labels3 + train_labels4\n",
    "\n",
    "# 测试数据\n",
    "test_words_list1, test_labels1 = loadfile('./text_classification/test/女性', '女性')\n",
    "test_words_list2, test_labels2 = loadfile('./text_classification/test/体育', '体育')\n",
    "test_words_list3, test_labels3 = loadfile('./text_classification/test/文学', '文学')\n",
    "test_words_list4, test_labels4 = loadfile('./text_classification/test/校园', '校园')\n",
    "\n",
    "test_words_list = test_words_list1 + test_words_list2 + test_words_list3 + test_words_list4\n",
    "test_labels = test_labels1 + test_labels2 + test_labels3 + test_labels4\n",
    "\n",
    "stop_words = open('./text_classification/stop/stopword.txt', 'r', encoding='utf-8').read()\n",
    "stop_words = stop_words.encode('utf-8').decode('utf-8-sig') # 列表头部\\ufeff处理\n",
    "stop_words = stop_words.split('\\n') # 根据分隔符分隔\n",
    "\n",
    "# 计算单词权重\n",
    "tf = TfidfVectorizer(stop_words, max_df=0.5)\n",
    "\n",
    "train_features = tf.fit_transform(train_words_list)\n",
    "# 上面fit过了，所以这里直接transform,下面的是测试的\n",
    "test_features = tf.transform(test_words_list)\n",
    "\n",
    "# 多项式贝叶斯分类器\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)\n",
    "predicted_labels = clf.predict(test_features)\n",
    "\n",
    "# 计算准确率\n",
    "print('准确率为：', metrics.accuracy_score(test_labels, predicted_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**使用另外一种方式：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.91\n"
     ]
    }
   ],
   "source": [
    "#!/usr/bin/env python\n",
    "# -*- coding:utf8 -*-\n",
    "\n",
    "import os\n",
    "import jieba\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn import metrics\n",
    "\n",
    "# 标签映射\n",
    "LABEL_MAP = {'体育': 0, '女性': 1, '文学': 2, '校园': 3}\n",
    "# 加载停用词，使用二进制的方式进行读取\n",
    "with open('./text_classification/stop/stopword.txt', 'rb') as f:\n",
    "    STOP_WORDS = [line.strip() for line in f.readlines()]\n",
    "\n",
    "\n",
    "def load_data(base_path):\n",
    "    \"\"\"\n",
    "    :param base_path: 基础路径\n",
    "    :return: 分词列表，标签列表\n",
    "    \"\"\"\n",
    "    documents = []\n",
    "    labels = []\n",
    "    \n",
    "    # root 作为根路径\n",
    "    for root, dirs, files in os.walk(base_path): # 循环所有文件并进行分词打标\n",
    "        for file in files:\n",
    "            label = root.split('/')[-1] # 直接截取倒数第一个斜杠\n",
    "            labels.append(label)\n",
    "            filename = os.path.join(root, file)\n",
    "            with open(filename, 'rb') as f: # 因为字符集问题因此直接用二进制方式读取\n",
    "                content = f.read()\n",
    "                word_list = list(jieba.cut(content))\n",
    "                words = [wl for wl in word_list]\n",
    "                documents.append(' '.join(words))\n",
    "    return documents, labels\n",
    "\n",
    "\n",
    "def train_fun(td, tl, testd, testl):\n",
    "    \"\"\"\n",
    "    构造模型并计算测试集准确率，字数限制变量名简写\n",
    "    :param td: 训练集数据\n",
    "    :param tl: 训练集标签\n",
    "    :param testd: 测试集数据\n",
    "    :param testl: 测试集标签\n",
    "    :return: 测试集准确率\n",
    "    \"\"\"\n",
    "    # 计算矩阵\n",
    "    tt = TfidfVectorizer(stop_words=STOP_WORDS, max_df=0.5)\n",
    "    tf = tt.fit_transform(td)\n",
    "    # 训练模型\n",
    "    clf = MultinomialNB(alpha=0.001).fit(tf, tl)\n",
    "    # 模型预测\n",
    "    test_tf = TfidfVectorizer(stop_words=STOP_WORDS, max_df=0.5, vocabulary=tt.vocabulary_)\n",
    "    test_features = test_tf.fit_transform(testd)\n",
    "    predicted_labels = clf.predict(test_features)\n",
    "    # 获取结果\n",
    "    x = metrics.accuracy_score(testl, predicted_labels)\n",
    "    return x\n",
    "\n",
    "\n",
    "# text classification与代码同目录下\n",
    "train_documents, train_labels = load_data('./text_classification/train')\n",
    "test_documents, test_labels = load_data('./text_classification/test')\n",
    "x = train_fun(train_documents, train_labels, test_documents, test_labels)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
